Spatial analysis is a process that begins with exploring and mapping a dataset and can lead to potentially complex models and visualizations of real world features and phenomena. Spatial queries are the building blocks of this analytical process. These queries are software operations that allow us to ask questions of our data and which return data metrics, subsets or new data objects. In this lesson we explore the two basic types of spatial queries: measurement queries and relationship queries.
Instructor Notes
The basic types of spatial queries are:
Both of these types of queries operate on the geometry of features in one or two datasets and are dependent on the type of geometry. For example, with point features you can make distance measurements or ask what points are spatially inside polygon objects. Polygon features, on the other hand, allow for a wider range of both measurement and spatial relationship queries.
An important distinction between these two types of queries is that measurement queries depend on the CRS of the data while spatial relationship queries do not. This is because topological relationships, the term used to describe spatial relationships, are invariant to rotation, translation and scaling transformations like those that CRS transformations entail.
In this lesson we’ll work through examples of each of those types of queries.
Then we’ll try a very common spatial analysis method that is a conceptual amalgam of those two types: proximity analysis.
Load the libraries we will use.
library(sf)
library(tmap)
Read in the CA census tracts data and then take a look at its geometry and attributes.
census_tracts = st_read("notebook_data/census/Tracts/cb_2013_06_tract_500k.shp")
## Reading layer `cb_2013_06_tract_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Tracts/cb_2013_06_tract_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 8043 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.4096 ymin: 32.53416 xmax: -114.1312 ymax: 42.00952
## Geodetic CRS: NAD83
plot(census_tracts$geometry)
head(census_tracts)
## Simple feature collection with 6 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.3049 ymin: 37.76456 xmax: -122.2074 ymax: 37.84851
## Geodetic CRS: NAD83
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD ALAND
## 1 06 001 400300 1400000US06001400300 06001400300 4003 CT 1105329
## 2 06 001 400900 1400000US06001400900 06001400900 4009 CT 420877
## 3 06 001 402200 1400000US06001402200 06001402200 4022 CT 712082
## 4 06 001 402800 1400000US06001402800 06001402800 4028 CT 398311
## 5 06 001 404800 1400000US06001404800 06001404800 4048 CT 628405
## 6 06 001 406100 1400000US06001406100 06001406100 4061 CT 1843685
## AWATER geometry
## 1 0 MULTIPOLYGON (((-122.2642 3...
## 2 0 MULTIPOLYGON (((-122.2856 3...
## 3 0 MULTIPOLYGON (((-122.304 37...
## 4 0 MULTIPOLYGON (((-122.276 37...
## 5 0 MULTIPOLYGON (((-122.2182 3...
## 6 74875 MULTIPOLYGON (((-122.2387 3...
Select just the Alameda County census tracts.
census_tracts_ac = census_tracts[census_tracts$COUNTYFP=='001',]
plot(census_tracts_ac)
We’ll start off with some simple measurement queries.
We can get the area of each of our census tracts using thesf function st_area.
st_area(census_tracts_ac)[1:10]
## Units: [m^2]
## [1] 1106564.4 435823.9 693541.5 400641.5 618815.4 1962114.5 370336.5
## [8] 478238.0 587716.8 857795.3
Okay!
We got…
numbers!
…?
Question
Let’s take a look at our CRS.
st_crs(census_tracts_ac)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
Wow! We’re working with data that are in what is called an unprojected CRS. That means that the coordinates are latitude and longitude values and the units are decimal degrees. However, the sf::st_area function automatically returned area measurements in square meters (rather than in square degrees, which don’t make sense.)
How did it do this? Well, you can check out the help documentation for ?st_area for more information. If the data have a projected CRS, st_area uses Euclidean geometry to return area measurements in the units of the CRS. For an unprojected CRS, st_area calculates geodetic area on a curved surface model of the Earth and returns measurements in square meters. Pretty cool and pretty useful right?
That said, when doing spatial analysis, we will almost always want to convert all of our data to the same projected CRS since many spatial operations do not work with geographic CRSs.
Time to project! We’ll transform the data to the UTM Zone 10N, NAD83 CRS (EPSG 26910) which is appropriate for Northern California location data and is highly accurate for measurement queries for areas within the zone.
#Transform CRS of census tract data
census_tracts_ac_utm10 = st_transform(census_tracts_ac, 26910)
Now check it..especially look for the units.
st_crs(census_tracts_ac_utm10)
## Coordinate Reference System:
## User input: EPSG:26910
## wkt:
## PROJCRS["NAD83 / UTM zone 10N",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["UTM zone 10N",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-123,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["unknown"],
## AREA["North America - 126°W to 120°W and NAD83 by country"],
## BBOX[30.54,-126,81.8,-119.99]],
## ID["EPSG",26910]]
Now let’s try our area calculation again.
st_area(census_tracts_ac_utm10)[1:10]
## Units: [m^2]
## [1] 1105796.6 435518.4 693052.3 400361.5 618393.6 1960768.5 370083.9
## [8] 477916.4 587326.6 857191.6
What if we compare areas calculated with our unprojected and projected CRS data?
# Using format to make for easier to read display
print(format(st_area(census_tracts_ac)[[1]], big.mark=','))
## [1] "1,106,564 [m^2]"
print(format(st_area(census_tracts_ac_utm10)[[1]], big.mark=","))
## [1] "1,105,797 [m^2]"
Hmmm… The numbers are a bit different…specifically…
format((st_area(census_tracts_ac)[[1]] - st_area(census_tracts_ac_utm10)[[1]]),digits=0, big.mark=',')
## [1] "768 [m^2]"
You may have noticed that our census tracts already have an area column in them.
Let’s also compare the calculated areas with the data in this column.
print(st_area(census_tracts_ac)[[1]])
## 1106564 [m^2]
print(st_area(census_tracts_ac_utm10)[[1]])
## 1105797 [m^2]
print(census_tracts$ALAND[1])
## [1] 1105329
Question
What explains the discrepancy? Which areas are correct? Which are incorrect?
We can also calculate the area for Alameda county summing the areas of all census tracts.
sum(st_area(census_tracts_ac_utm10))
## 1948917581 [m^2]
We can look up how large Alameda County is to check our work.The county is 739 miles2, which is around 1,914,001,213 meters2. I’d say we’re pretty close!
# Sum the area of all Alameda county Census Tracts dynamically in square miles...
sum(units::set_units(st_area(census_tracts_ac_utm10),mi^2))
## 752.4813 [mi^2]
Calculating the area of all features and adding the output to the dataframe is a useful operation because it allows us to convert count variables like population to densities.
# Add the area of all Alameda County Census tracts to the data frame
census_tracts_ac_utm10$area_sqmi <-units::set_units(st_area(census_tracts_ac_utm10), mi^2)
# Check it by summing
print(sum(census_tracts_ac_utm10$area_sqmi))
## 752.4813 [mi^2]
# Take a look
head(census_tracts_ac_utm10,3)
## Simple feature collection with 3 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 561190.1 ymin: 4184071 xmax: 566536.6 ymax: 4189277
## Projected CRS: NAD83 / UTM zone 10N
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD ALAND
## 1 06 001 400300 1400000US06001400300 06001400300 4003 CT 1105329
## 2 06 001 400900 1400000US06001400900 06001400900 4009 CT 420877
## 3 06 001 402200 1400000US06001402200 06001402200 4022 CT 712082
## AWATER geometry area_sqmi
## 1 0 MULTIPOLYGON (((564745 4188... 0.4269505 [mi^2]
## 2 0 MULTIPOLYGON (((562861.1 41... 0.1681546 [mi^2]
## 3 0 MULTIPOLYGON (((561264.5 41... 0.2675890 [mi^2]
You may be wondering how R is managing the units of our measurements.
It turns out that sf depends on the units package to track units.
This is super convenient! But there is a gotcha:
# convert to square kilometers
sum(st_area(census_tracts_ac_utm10)) / (1000^2)
## 1948.918 [m^2]
Oops! Our manual conversion to square kilometers gave us the right number but kept the now-wrong units!
Here’s the proper way to convert:
units::set_units(sum(st_area(census_tracts_ac_utm10)), km^2)
## 1948.918 [km^2]
Much nicer! In case you’re wondering how we knew the right abbreviation to use for kilometers, check out the leftmost column in this reference table:
# View(units::valid_udunits())
We can use the st_length operator in the same way to calculate the length or perimeter of features. Always take note of the output units!
st_length(census_tracts)[1:10]
## Units: [m]
## [1] 5358.919 2757.904 5397.799 2682.912 3711.654 6409.748 2540.302 3090.729
## [9] 4262.487 3936.085
The st_distance function can be used to find the pairwise distance between two sets of geometries.
st_distance(census_tracts_ac_utm10, census_tracts_ac_utm10)[1:5,1:5]
## Units: [m]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.0000 889.5047 3130.831 2845.976 4237.210
## [2,] 889.5047 0.0000 2252.841 2703.087 6149.383
## [3,] 3130.8314 2252.8406 0.000 1315.946 6378.630
## [4,] 2845.9762 2703.0874 1315.946 0.000 4433.345
## [5,] 4237.2105 6149.3828 6378.630 4433.345 0.000
You can also use it to find the distance between specific features.
# Identify my tracts of interest
mytracts = c('4201','4202','4203','4204')
# What is the distance between tract 4201 and all other tracts
st_distance(census_tracts_ac_utm10[census_tracts_ac_utm10$NAME=='4101',],
census_tracts_ac_utm10[census_tracts_ac_utm10$NAME %in% mytracts,] )
## Units: [m]
## [,1] [,2] [,3] [,4]
## [1,] 19220.14 18946.69 19263.81 18822.72
Spatial relationship queries consider how two geometries or sets of geometries relate to one another in space. For example, you may want to know what schools are located within the City of Berkeley or what East Bay Regional Parks have land within Berkeley. You may also want to combine a measurement query with a spatial relationship query. Example, you may want to know the total length of freeways within the city of Berkeley.
Here is a list of some of the more commonly used sf spatial relationship operations.
These can be used to select features in one dataset based on their spatial relationship to another. In other works, you can use these operations to make spatial selections / create spatial subsets.
Enough talk. Let’s work through some examples.
First, load the CA Places data and select the city of Berkeley and save it to a sf dataframe.
places = st_read('notebook_data/census/Places/cb_2018_06_place_500k.shp')
## Reading layer `cb_2018_06_place_500k' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/census/Places/cb_2018_06_place_500k.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1521 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.2694 ymin: 32.53416 xmax: -114.229 ymax: 41.99314
## Geodetic CRS: NAD83
berkeley = places[places$NAME=='Berkeley',]
plot(berkeley$geometry)
Then, load the Alameda County schools data and make it a spatial dataframe.
schools_df = read.csv('notebook_data/alco_schools.csv')
schools_sf = st_as_sf(schools_df,
coords = c('X','Y'),
crs = 4326)
Check that the two sf dataframes have the same CRS.
st_crs(schools_sf) == st_crs(berkeley)
## [1] FALSE
They don’t have the same CRS so we need to align them. Let’s transform (or reproject) the CRS of both of these dataframes to UTM10N, NAD83 (EPSG 26910). This is a commonly used CRS for Northern CA data.
# Transform data CRSs...
schools_utm10 <- st_transform(schools_sf, 26910)
berkeley_utm10 <- st_transform(berkeley, 26910)
If you look at the Schools data you will see that it has a City column. So we can subset the data by attribute to select the Schools in Berkeley. No need to do a spatial selection.
berkeley_schools = schools_utm10[schools_utm10$City=='Berkeley',]
dim(berkeley_schools)
## [1] 31 8
Confirm the results by plotting the data
plot(berkeley_utm10$geometry)
plot(berkeley_schools$geometry, add=T)
That looks good and was a relatively simple operation. But what if the schools data didn’t have that city column or if only some of the rows had a value in that column. How can we identify the schools in Berkeley spatially?
Here’s how!
# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, , op=st_intersects] #NO QUOTES!!!
Yes that was it! Take a long look at that simple yet powerful spatial selection syntax.
You should interpret that syntax as:
"Select all features (i.e. rows) in the schools_utm10 dataframe:
and all of the columns:
(all because the extraction brackets have no second argument)
whose geometry spatially intersects the Berkeley_utm10 geometry
The op=st_intersects argument is optional because st_intersects is the default spatial selector.
To emphasize this, let’s rerun the last command
# SPATIALLY select only the schools within Berkeley
berkeley_schools_spatial = schools_utm10[berkeley_utm10, ]
spatiallly intersects mean?Here’s one way to explain it.
Geometry A spatially intersects Geometry B if any of its parts (e.g., a point, line segment, or polygon) is equivalent to, touches, crosses, is contained by, contains, or overlaps any part of Geometry B.
So st_intersects is the mother of all spatial relationships! It is the most general and the most useful. However, you can specify any of those more specific spatial relationships by setting op= to any of the options listed in the ?st_intersects? help documentation.
Let’s check out the sf object that our selection returned.
# How many schools did we get
dim(berkeley_schools_spatial)
## [1] 32 8
# Map the results
plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)
Interestingly, we have one more school in Berkeley based on the spatial selection!? Let’s take a look.
plot(berkeley_utm10$geometry)
plot(berkeley_schools_spatial$geometry, add=T)
plot(berkeley_schools$geometry,col="red", add=T)
Let’s use an interactive tmap to zoom into the school that was not selected by attribute but was spatially selected.
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(berkeley_utm10) +
tm_borders() +
tm_shape(berkeley_schools_spatial) +
tm_dots(col="black", size=.3) +
tm_shape(berkeley_schools) +
tm_dots(col="red", size=.1)
IMPORTANT: The default spatial selection operator is
st_intersects. If you want to use any other operator, it must be specified.
For example, we can use the st_disjoint operator to select only those schools NOT in Berkeley.
# Select all Alameda County Schools NOT in Berkeley with the disjoint operator
berkeley_schools_disjoint = schools_utm10[berkeley_utm10, , op=st_disjoint]
# Plot the result
plot(berkeley_schools_disjoint$geometry)
plot(berkeley_utm10, col=NA, border="red", add=T)
## Warning in plot.sf(berkeley_utm10, col = NA, border = "red", add = T): ignoring
## all but the first attribute
There is no need to memorize these spatial operators (aka predicates)! Here is a fantastic sf cheatsheet that lists and briefly explains all these common functions (and many more).
Let’s load a new dataset, the CA Protected Areas Database (CPAD), to demonstrate these spatial queries in more detail.
This dataset contains all of the protected areas (parks and the like) in California.
We will use this data and the Alameda County Census Tract Data that we created earlier to ask “What protected areas are within Alameda County?”
First load the CPAD data.
cpad = st_read('./notebook_data/protected_areas/CPAD_2020a_Units.shp')
## Reading layer `CPAD_2020a_Units' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/protected_areas/CPAD_2020a_Units.shp' using driver `ESRI Shapefile'
## Simple feature collection with 17068 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -374984.2 ymin: -604454.8 xmax: 540016.3 ymax: 449743.2
## Projected CRS: NAD83 / California Albers
What is the CRS of the CPAD data?
Let’s transform the data to match census_tracts_ac_utm10.
cpad_utm10 = st_transform(cpad, st_crs(census_tracts_ac_utm10))
Let’s plot the data in so that we know what to expect. CPAD is big so wait for it…
plot(census_tracts_ac_utm10$geometry, col='grey', border="grey")
plot(cpad_utm10$geometry, col='green', add=T)
We can see from our map that some of the protected areas are completely within Alameda County, some of them overlap, and some are completely outside of the county. To get both of the “inside” and “overlaps” cases we use the st_intersects spatial selection operator, which is the default. Let’s check it out.
cpad_intersects = cpad_utm10[census_tracts_ac_utm10,] #st_intersects
cpad_within = cpad_utm10[census_tracts_ac_utm10, , op=st_within] #st_within
We can use tmap to explore the difference in the results from st_intersects vs st_within
tmap_mode('view')
## tmap mode set to interactive viewing
tm_shape(census_tracts_ac_utm10)+
tm_polygons(col="gray", border.col="grey") +
tm_shape(cpad_intersects) +
tm_borders(col="green") +
tm_shape(cpad_within) +
tm_borders(col='red')
What you can see from the above, is that by default, st_intersects returns the features that intersect but it does not clip the features to the boundary of Alameda County. For that, we would need to use a different spatial operation - st_intersection.
Let’s try it!
cpad_in_ac = st_intersection(cpad_utm10, census_tracts_ac_utm10)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
Great! Now, if we scroll the resulting sf object we’ll see that the COUNTY column of our resulting subset gives us a good sanity check on our results. Or does it?
table(cpad_in_ac$COUNTY)
##
## Alameda Contra Costa San Joaquin Santa Clara
## 839 20 2 4
Always check your output - both the attribute table & the geometry!
head(cpad_in_ac)
## Simple feature collection with 6 features and 31 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 561576 ymin: 4184117 xmax: 565367.9 ymax: 4188588
## Projected CRS: NAD83 / UTM zone 10N
## ACCESS_TYP UNIT_ID UNIT_NAME SUID_NMA AGNCY_ID
## 8478 Open Access 13443 Hardy Dog Park 19677 1228
## 11167 Open Access 47996 Redondo Park 32418 1228
## 4954 Open Access 14781 Temescal Creek Park 26230 1098
## 3585 Open Access 29165 Cypress Freeway Memorial Park 17873 1228
## 8131 Open Access 13451 South Prescott Park 25676 1228
## 9319 Open Access 15535 Mandela Parkway 21738 1228
## AGNCY_NAME AGNCY_LEV AGNCY_TYP
## 8478 Oakland, City of City City Agency
## 11167 Oakland, City of City City Agency
## 4954 Emeryville, City of City City Agency
## 3585 Oakland, City of City City Agency
## 8131 Oakland, City of City City Agency
## 9319 Oakland, City of City City Agency
## AGNCY_WEB LAYER MNG_AG_ID
## 8478 http://www2.oaklandnet.com/Government/o/opr/index.htm City 1228
## 11167 http://www2.oaklandnet.com/Government/o/opr/index.htm City 1228
## 4954 http://www.ci.emeryville.ca.us/158/City-Parks City 1098
## 3585 http://www2.oaklandnet.com/Government/o/opr/index.htm City 1228
## 8131 http://www2.oaklandnet.com/Government/o/opr/index.htm City 1228
## 9319 http://www2.oaklandnet.com/Government/o/opr/index.htm City 1228
## MNG_AGENCY MNG_AG_LEV MNG_AG_TYP PARK_URL COUNTY ACRES
## 8478 Oakland, City of City City Agency <NA> Alameda 0.788
## 11167 Oakland, City of City City Agency <NA> Alameda 0.250
## 4954 Emeryville, City of City City Agency <NA> Alameda 1.527
## 3585 Oakland, City of City City Agency <NA> Alameda 1.278
## 8131 Oakland, City of City City Agency <NA> Alameda 4.269
## 9319 Oakland, City of City City Agency <NA> Alameda 12.470
## LABEL_NAME YR_EST DES_TP GAP_STS
## 8478 Hardy Dog Park 0 Local Park 4
## 11167 Redondo Park 0 Local Park 4
## 4954 Temescal Creek Park 1972 Local Park 4
## 3585 Cypress Freeway Memorial Park 0 Local Park 4
## 8131 South Prescott Park 0 Local Park 4
## 9319 Mandela Parkway 0 Local Recreation Area 4
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD
## 8478 06 001 400300 1400000US06001400300 06001400300 4003 CT
## 11167 06 001 400300 1400000US06001400300 06001400300 4003 CT
## 4954 06 001 400900 1400000US06001400900 06001400900 4009 CT
## 3585 06 001 402200 1400000US06001402200 06001402200 4022 CT
## 8131 06 001 402200 1400000US06001402200 06001402200 4022 CT
## 9319 06 001 402200 1400000US06001402200 06001402200 4022 CT
## ALAND AWATER area_sqmi geometry
## 8478 1105329 0 0.4269505 [mi^2] POLYGON ((565274.7 4188544,...
## 11167 1105329 0 0.4269505 [mi^2] POLYGON ((565005.3 4188113,...
## 4954 420877 0 0.1681546 [mi^2] POLYGON ((563475.6 4188002,...
## 3585 712082 0 0.2675890 [mi^2] POLYGON ((562225.1 4184994,...
## 8131 712082 0 0.2675890 [mi^2] POLYGON ((561576 4184223, 5...
## 9319 712082 0 0.2675890 [mi^2] POLYGON ((562327.9 4184999,...
Let’s also use an overlay plot to check the output geometry.
tm_shape(census_tracts_ac_utm10) +
tm_polygons(col='gray', border.col='gray') +
tm_shape(cpad_in_ac) +
tm_polygons(col = 'ACRES', palette = 'YlGn',
border.col = 'black', lwd = 0.4,
alpha = 0.8,
title = 'Protected areas in Alameda County, colored by area')
It really depends! But make sure you understand the difference.
st_intersects is a logical operator that returns True if two geometries intersect in any way. When used to subset (or filter) a spatial dataframe, st_intersects returns those features in the dataframe that intersect with the filter dataframe.
On the other hand, st_intersection returns a new spatial dataframe that set intersection of the two dataframes, including both the geometries and attributes of the intersecting features. Use st_intersection with caution and always check your results.
It’s your turn.
Write a spatial relationship query to create a new dataset containing only the BART stations in Berkeley.
Run the next two cells to (1) load the dataset containing Berkeley BART stations and then reproject it to the same CRS as that used by the Berkeley_utm10 dataframe (EPSG: 26910)’ (2) plot these two datasets in an overlay map.
Then, write your own code to: 1. Spatially select the BART stations that are within Berkeley 2. Plot the Berkeley boundary and then overlay the selected BART stations.
# load the Berkeley boundary
bart_df = st_read("notebook_data/transportation/bart.csv")
## Reading layer `bart' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/bart.csv' using driver `CSV'
## Warning: no simple feature geometries present: returning a data.frame or tbl_df
bart_sf = st_as_sf(bart_df,
coords = c('lon','lat'),
crs = 4326)
# transform to EPSG:26910
bart_utm10 = st_transform(bart_sf, st_crs(berkeley_utm10))
# display
head(berkeley_utm10)
## Simple feature collection with 1 feature and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 559347.6 ymin: 4188961 xmax: 567371.4 ymax: 4195617
## Projected CRS: NAD83 / UTM zone 10N
## STATEFP PLACEFP PLACENS AFFGEOID GEOID NAME LSAD ALAND
## 1213 06 06000 02409837 1600000US0606000 0606000 Berkeley 25 27127391
## AWATER geometry
## 1213 18715614 MULTIPOLYGON (((559347.6 41...
Plot the data together
plot(bart_utm10$geometry)
plot(berkeley_utm10$geometry, border='blue', add=T)
Now, in the cell below, write the code to spatially select the Berkeley BART stations, then make the map.
# YOUR CODE HERE:
# Spatially select the BART stations within Berkeley
# Plot the Bart stations in Berkeley overlaid on top of the Berkeley City Boundary
Now that we’ve seen the basic idea of spatial measurement and spatial relationship queries, let’s take a look at a common analysis that combines those concepts: promximity analysis.
Proximity analysis seeks to identify near features - or, in other words, all features in a focal feature set that are within some maximum distance of features in a reference feature set.
A very common workflow for this analysis is:
Buffer around the features in the reference dataset to create buffer polygons.
Run a spatial relationship query to find all focal features that intersect (or are within) the buffer polygons.
Let’s read in our bike boulevard data again.
Then we’ll find out which of our Berkeley schools are within a block’s distance (200 meters) of the bike boulevards.
bike_blvds = st_read('notebook_data/transportation/BerkeleyBikeBlvds.geojson')
## Reading layer `BerkeleyBikeBlvds' from data source `/Users/pattyf/Documents/Dlab/workshops/2021/Geospatial-Fundamentals-in-R-with-sf/notebook_data/transportation/BerkeleyBikeBlvds.geojson' using driver `GeoJSON'
## Simple feature collection with 211 features and 10 fields
## Geometry type: MULTILINESTRING
## Dimension: XY
## Bounding box: xmin: 561541.2 ymin: 4189007 xmax: 566451.6 ymax: 4193483
## Projected CRS: WGS 84 / UTM zone 10N
plot(bike_blvds$geometry)
Of course, we need to reproject the boulevards to our projected CRS.
bike_blvds_utm10 = st_transform(bike_blvds, st_crs(berkeley_utm10))
Now we can create our 200 meter bike boulevard buffers.
bike_blvds_buf = st_buffer(bike_blvds_utm10, dist=200)
Now let’s overlay everything.
tm_shape(berkeley_utm10) +
tm_polygons(col = 'lightgrey') +
tm_shape(bike_blvds_buf) +
tm_polygons(col = 'pink', alpha = 0.5) +
tm_shape(bike_blvds_utm10) +
tm_lines() +
tm_shape(berkeley_schools_spatial) +
tm_dots(col = 'purple', size=0.2)
Great! Looks like we’re all ready to run our spatial relationship query to complete the proximity analysis. At this point (pun intended) select the schools that are in within the bike boulevard buffer polygons.
schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf,]
# or
#schools_near_blvds = berkeley_schools_spatial[bike_blvds_buf, , op='st_within']
Now let’s overlay again, to see if the schools we selected make sense.
tm_shape(berkeley_utm10) +
tm_polygons(col = 'lightgrey') +
# add the bike blvd buffer polygons
tm_shape(bike_blvds_buf) +
tm_polygons(col = 'pink', alpha = 0.5) +
# Add the bike blvd line features
tm_shape(bike_blvds_utm10) +
tm_lines() +
# Add all berkeley schools
tm_shape(berkeley_schools_spatial) +
tm_dots(col = 'purple', size=0.2) +
# Add schools near bike blvds in yellow
tm_shape(schools_near_blvds) +
tm_dots(col = 'yellow', size=0.2)
You can use st_distance and its companion function st_nearest_feature to compute the distance between each feature and the nearest bike boulevard. The st_nearest_feature function returns the ID of the closest feature.
# Identify the nearest bike boulevard for each school
nearest = st_nearest_feature(berkeley_schools_spatial,bike_blvds_utm10)
# take a look!
nearest
## [1] 71 171 39 136 42 30 163 172 127 171 189 156 137 33 187 197 50 207 169
## [20] 102 125 137 3 109 207 76 135 173 56 102 146 76
Then we can calculate the distance between each school and it’s nearest bike boulevard.
st_distance(berkeley_schools_spatial, bike_blvds_utm10[nearest,], by_element=TRUE)
## Units: [m]
## [1] 13.848161 985.459488 309.889446 369.402946 196.011379 15.332395
## [7] 27.250406 439.758905 107.902846 926.216792 193.030072 181.836256
## [13] 373.736477 215.903128 184.321307 186.446907 1.288383 94.064527
## [19] 211.477566 218.613628 186.913116 230.212129 15.162313 188.829602
## [25] 232.764113 224.700672 173.920971 15.892361 514.765384 92.556921
## [31] 93.426741 128.131187
Now it’s your turn to try out a proximity analysis!
Write your own code to find all BERKELEY schools within walking distance (1 km) of a BART station.
As a reminder, let’s break this into steps:
bart_utm10 dataframeTo see the solution, look at the hidden text below.
# YOUR CODE HERE:
# Spatially select the Berkeley Bart Stations
# You may have done this in a previous exercise.
# buffer the BART stations to 1 km
# spatially select the schools within the buffers
# Map your results with tmap
# plot the Berkeley boundary (for reference) in lightgrey
# add the BART stations (for reference) to the plot in green
# add the BART buffers (for check) in lightgreen
# add all Berkeley schools (for reference) in black
# add the schools near BART (for check) in yellow
Compute the distance between each Berkeley School and its nearest BART station!
#YOUR CODE HERE
Leveraging what we’ve learned in our earlier lessons, we got to work with map overlays and start answering questions related to proximity. Key concepts include:
st_area,st_lengthst_distancest_intersects,st_intersectionst_within, etc.st_buffer